Energy Landscapes for Base-Flipping in a Model DNA Duplex

We explore the process of base-flipping for four central bases, adenine, guanine, cytosine, and thymine, in a deoxyribonucleic acid (DNA) duplex using the energy landscape perspective. NMR imino-proton exchange and fluorescence correlation spectroscopy studies have been used in previous experiments to obtain lifetimes for bases in paired and extrahelical states. However, the difference of almost 4 orders of magnitude in the base-flipping rates obtained by the two methods implies that they are exploring different pathways and possibly different open states. Our results support the previous suggestion that minor groove opening may be favored by distortions in the DNA backbone and reveal links between sequence effects and the direction of opening, i.e., whether the base flips toward the major or the minor groove side. In particular, base flipping along the minor groove pathway was found to align toward the 5′ side of the backbone. We find that bases align toward the 3′ side of the backbone when flipping along the major groove pathway. However, in some cases for cytosine and thymine, the base flipping along the major groove pathway also aligns toward the 5′ side. The sequence effect may be caused by the polar interactions between the flipping-base and its neighboring bases on either of the strands. For guanine flipping toward the minor groove side, we find that the equilibrium constant for opening is large compared to flipping via the major groove. We find that the estimated rates of base opening, and hence the lifetimes of the closed state, obtained for thymine flipping through small and large angles along the major groove differ by 6 orders of magnitude, whereas for thymine flipping through small angles along the minor groove and large angles along the major groove, the rates differ by 3 orders of magnitude.

transition states between a pair of minima were obtained using the doubly-nudged elastic band (DNEB) method. [6][7][8] Transition state candidates were further re ned using hybrid eigenvectorfollowing. 9,10 After each cycle of connection-making attempts, a large number of intervening minima and transition states may be found. Next, the missing connection algorithm is employed to construct a priority list of of connection attempts based on an appropriate edge-weight metric. 11,12 After an initial discrete path was found between two endpoints, the stationary point database was further expanded using the PATHSAMPLE 13 program. In the current work further sampling was conducted using the CONNECTPAIRS and UNTRAP 14 schemes implemented within PATH-SAMPLE. CONNECTPAIRS speci es pairs of minima for connection attempts. The main advantage of using the other keyword, UNTRAP, is that by specifying the minima at funnel bottoms faster energy pathways can be e ciently found for all the minima in the funnel. Consequently, an entire funnel lying higher up in the disconnectivity graph may be connected to lower energy. In contrast, CONNECTPAIRS requires individual minima to be speci ed, and connections are attempted only for those minima. However, CONNECTPAIRS was found to be much easier to use during the current work and was extensively employed for further sampling. UNTRAP was also used to connect one of the funnels lying higher up in the disconnectivity graph of adenine.
Once the disconnectivity graphs stopped changing appreciably with additional sampling, and consistent rates (i.e., showing negligible variation with additional sampling) were obtained for base ipping, sampling was assumed to be complete, and the landscape was deemed to be converged. 15

Calculation of free energies
Statistical mechanics is used for calculating macroscopic properties from a microscopic point of view. 26 For example, the free energy F i (T ) associated with a local minimum i at temperature T can be expressed as, 26 where k B is the Boltzmann constant and Z i (T ) is the partition function of minimum i at temperature T . This partition function can be calculated from the Laplace transform of the microcanonical density of states Ω i (E), i.e., 2 where β = 1/k B T . Further, a harmonic approximation to Ω i (E) can be calculated using normal mode vibrational frequencies, 26 where n i = 2N A N B . . . /o i . Here, N A is the number of atoms of type A in the system, A, B, . . . are all the di erent kinds of atoms present in the system, o i is the order of point group, κ is the number of degrees of freedom, Γ (κ) = (κ − 1)!, ν α (i) = ω α (i) /2π, and ω α (i) is the normal mode vibrational frequency. [26][27][28] Using equation 3 in 2, Z i (T ) can be written as

Kinetic properties
The kinetics of a chemical reaction are described using rate constants. The rate constants can be calculated using Transition State Theory (TST), i.e., [33][34][35][36][37] Here, k † i (T ) represents the unimolecular canonical reaction rate constant from minimum i to is the partition function of the transition state, Z i (T ) is the partition function of minimum i, and ∆V is the change in potential energy between the minimum and the transition state.
However, biomolecular reactions usually involve multiple minima-transition state-minima steps. In this case, the system is evolved using a linear master equation 38,39 which assumes that the overall dynamics are Markovian. 2 Markovian dynamics means that the rate of transition from one minimum to the next is independent of the pathway that was taken to reach the initial minimum. 2 The assumption behind this independence is that the reactant has enough time to reach a thermal equilibrium when it arrives at a particular state. Using the linear master equation, the change in occupation probability p a (t) of state a with time t, can be expressed as

S4
where k ab represents the rate constant of the single step reaction from minimum b to a and k ba represents the rate constant from minimum a to b where a and b are geometrically distinct.
Since, the pathway is multi-step and has intervening minima, a steady-state approximation may be used for intermediate minima. 3,40,41 However, biomolecular reactions are rarely single step with the reactant de ned by a single minimum. It is usually necessary to de ne reactant and product using an ensemble of states, i.e., a set of minima. The master equation can now be rewritten for an ensemble of states A as, where p A and p B are the occupation probabilities of A and B, respectively. and The above equation assumes a local equilibrium condition, i.e., The phenomenological rate constants k AB and k BA include sums over all the elementary transitions from a ← b, and b ← a that lie on the boundary of regions de ned by set A and B. 3 To treat a multiple step chemical reaction, the steady-state approximation may be applied for intervening states i. In this case, the evolution of occupation probability for intermediate state i is assumed to be, dp

S5
The steady-state rate constant from B to A (k SS AB ) can now be written as, Equation (13) can be further simpli ed using the de nitions of transition (or branching) probability p ab as given in Equation (14), waiting time or inverse of escape rate or lifetime τ de ned in Equation (15), and then committor probability de ned in Equation (16). 2 The committor probability from a minimum b to A is de ned as the probability that a random walk starting from b will end up in a minimum contained in set A before visiting a minimum in set B. 41 Then, Equation (13) can be rewritten as, Here, B and A are the set of reactant and product minima and b and a are speci c representative local minimum belonging to B and A, respectively.
When the steady-state approximation is not taken into account, i.e., the waiting time in intermediate minima is not negligible, the e ective waiting time in the reactant state b increases and is given by The non-steady state rate constant is de ned as, For multiple multi-step pathways between the reactant minima and product minima the rates are S6 calculated by making a graph out of the stationary point database. In the graph, each minimum is represented by a node and the minima that are directly connected by a transition state are represented using edges between the respective nodes of the minima. 42 The pathway that contributes the most to the rate constant is calculated using Dijkstra's shortest path algorithm. 11 The overall rate constant can be calculated e ciently using Graph Transformation, 41 where p Ab is the renormalised branching probability from b to A and τ b corresponds to the renormalised waiting time for state b.
The linear master equation is solved by combining a matrix formulation with graph transformation, which proves to be even more e cient. 44

Input les and keywords
A sample pathdata input le for extracting fastest pathway between two minima using the PATHSAMPLE program has the following keywords An explanation of the keywords contained in the pathdata and odata.connect les can be found at the PATHSAMPLE 13 and OPTIM 4 websites. Some of these keywords in odata.connect are explained here in the sequence in which they are used to generate a discrete path between two endpoints.

For checking the initial and nal states
• REOPTIMISEENDPOINTS checks the RMS (root mean square) force of the endpoint geometries and reconverges these endpoints if the RMS is above the threshold.
• NOCISTRANS performs chirality and cis-trans checks on the two minima.
• LPERMDIST is used for local permutational alignment of the minima. 45

For connecting endpoints with images and springs
• NEBK is used to specify the spring constant of the springs joining the images.
• The arguments of NEWNEB instruct the OPTIM program about the maximum number of images, the maximum number of iterations allowed, and the convergence criterion for the rst DNEB run.

For optimising the DNEB images using L-BFGS and obtaining transition state guesses
• MAXBFGS speci es the maximum step length.
• BFGSSTEPS is used to de ne the maximum number of steps.
• The rst argument of MAXERISE speci es the threshold energy. Any step that leads to an energy rise above this threshold will be rejected.
• BFGSMIN designates the convergence criterion on RMS gradient.

S9
4. For obtaining a converged transition state from a candidate geometry using Rayleigh-Ritz minimisation and hybrid eigenvector-following • The steps taken during Rayleigh-Ritz minimisation are checked using the trust radius given by TRAD • The second argument of MAXERISE speci es the threshold energy. Any step that causes an energy rise above this threshold is rejected.
• MAXMAX de nes the maximum step size that can be taken in uphill direction during eigenvector-following.
• STEPS is used to specify the maximum number of steps allowed during transition state search using eigenvector-following.
• The di erent arguments of the keyword BFGSTS are employed when L-BFGS is used for energy minimisation in the orthogonal subspace during HEF.
• The geometry and energy of transition state is further checked using NOCISTRANS, LPERMDIST, GEOMDIFFTOL and EDIFFTOL.

For connecting a transition state to two local minima
• MAXSTEP de nes the maximum step length that can be taken along the direction of negative eigenmode.
• PUSHOPT is used to specify the magnitude of individual steps calculated using Goldensection search, maximum iterations allowed, and convergence criterion.
• The minima obtained are checked for energy and geometry using EDIFFTOL, GEOMD-IFFTOL, NOCISTRANS and LPERMDIST.

For choosing minima for subsequent newconnect cycles
• DIJKSTRA EXP is used to choose appropriate minima to obtain the shortest path between reactant and product minima. S10 7. A new connection cycle is attempted on the chosen minima using NEWCONNECT keyword.
The various arguments speci ed with this keyword are described below.
• • igb=2 speci es the use of the Onufriev, Bashford and Case generalised Born implicit solvent model, which is the recommended solvent model for nucleic acids. 47 Although new solvent models have been introduced within AMBER they have been parameterised for proteins only.
• ntb=0 de nes the non-periodic boundary conditions during the calculation of non-bonded interactions. This setting is recommended when using igb=2.
• rgbmax=25.0 de nes that the atoms within 25 Angstroms will be considered as atomic pairs and will be used during the e ective Born radii calculation.
• saltcon=0.1 speci es that the concentration of 1-1 (monovalent) mobile ions in the implicit solvent is 0.1 molar. The inclusion of these ions leads to the screening of electrostatic interactions. 48 S11 • cut=999 speci es the maximum distance in Angstroms that will be used while nding non-bonded (electrostatic and van der Waals) interactions.
The CPPTRAJ 49 script used for calculation of CPDb dihedral in the DNA duplex.   Figure S2: Free energy disconnectivity graphs for (a) adenine, (b) thymine, (c) guanine and (d) cytosine bases ipped out of a DNA duplex at 300 K. These graphs are the same as those given the main text and the order parameter represents the CPDb dihedral angle values. Here, these graphs are recolored to distinguish between bases ipped out via major (negative CPDb dihedral angle) and minor groove (positive CPDb dihedral angle) pathway.